C Mn Si Cr Ni Mo V Co Al W Cu Nb Ti B N Ms
1 0.19 1.25 0.23 0.07 0.66 0.54 0.00 0 0.00 0 0.00 0 0 0 0.000 645.00
2 1.00 0.45 0.00 1.52 3.33 0.00 0.00 0 0.00 0 0.00 0 0 0 0.000 359.00
3 0.26 0.58 0.49 1.65 0.18 0.84 0.38 0 0.00 0 0.07 0 0 0 0.000 653.00
4 0.06 1.96 0.32 0.00 0.00 0.18 0.22 0 0.02 0 0.00 0 0 0 0.005 721.00
5 1.20 1.88 0.00 0.00 0.00 0.00 0.00 0 0.00 0 0.00 0 0 0 0.000 323.00
6 0.38 0.69 0.20 0.95 1.58 0.26 0.00 0 0.00 0 0.00 0 0 0 0.000 589.75
Martensite Starting Temperature-Data Visualization & Data Exploration
Introduction to the Dataset (Martensite Starting Temperature)
This dataset contains the chemical compositions of various steels along with their Martensite starting temperatures (Ms), aiming to develop a model that predicts Ms based on steel chemistry.
Martensite is a phase where steel becomes extremely strong and can withstand high stresses, making it essential for industries like automotive to enhance crash safety. Ms is the critical temperature where steel’s internal structure changes into Martensite, and it varies with chemical composition. Currently, determining Ms requires lengthy tests using Thermogravimetry (TGA), which are prone to errors and often need repeating. A predictive model would save significant time and effort by eliminating the need for these tests.
The head of the dataset
C, Mn, Si, Cr, and etc. are referred to chemical elements based on the Periodic Table. They are Carbon, Manganese, Silicon, Chromium, and so on.
Checking for missing values
[1] 0
There are no missing values.
Checking the Structure of the Dataset
'data.frame': 1543 obs. of 16 variables:
$ C : num 0.19 1 0.26 0.06 1.2 0.38 0.43 0.004 0 2 ...
$ Mn: num 1.25 0.45 0.58 1.96 1.88 0.69 0.83 0.03 0 0.65 ...
$ Si: num 0.23 0 0.49 0.32 0 0.2 1.55 0.075 0 0.3 ...
$ Cr: num 0.07 1.52 1.65 0 0 ...
$ Ni: num 0.66 3.33 0.18 0 0 ...
$ Mo: num 0.54 0 0.84 0.18 0 0.26 0.4 0 0 0.5 ...
$ V : num 0 0 0.38 0.22 0 ...
$ Co: num 0 0 0 0 0 0 0 0 0 0 ...
$ Al: num 0 0 0 0.02 0 0 0 0 0 0.051 ...
$ W : num 0 0 0 0 0 0 0 0 0 0 ...
$ Cu: num 0 0 0.07 0 0 0 0 0 0 0.08 ...
$ Nb: num 0 0 0 0 0 0 0 0 0 0 ...
$ Ti: num 0 0 0 0 0 0 0 0 0 0 ...
$ B : num 0 0 0 0 0 0 0 0 0 0 ...
$ N : num 0 0 0 0.005 0 0 0 0.002 0 0 ...
$ Ms: num 645 359 653 721 323 ...
The dataset has 1543 observations and 16 variables. Ms represent Martensite starting temperature and is the dependent variable.
The variables have continuous numeric content.
Summary statistics of the dataset
Ms is ranged between 150 to 800 degree Celsius. However, the most of the data are in 350 to 800 degree Celsius. The model may be designed for a limited range of temperature where we have more data (350 to 800C).
Carbon (C) has some zero contents which does not make any sense. Steel means Iron + Carbon. Iron without Carbon is not steel. C rows with zero content should be removed.
Scatter plot for one element and Martensite starting temperature
Carbon rows with zero content are recognized.
Histogram of Martensite starting temperature
Filtering the dataset for temperatures between 300 to 800 degree Celsius
Building GLM models
- Previous knowledge in Materials Science has been utilized to select important predictors. I am not sure if this is a correct approach.
Call:
glm(formula = Ms ~ C + Ni + Mn + Mo + Si + Cr + V + Co, family = gaussian,
data = filtered_dataset)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 750.6469 3.1633 237.300 < 2e-16 ***
C -260.1772 4.4241 -58.809 < 2e-16 ***
Ni -12.4065 0.2859 -43.394 < 2e-16 ***
Mn -25.5874 1.9657 -13.017 < 2e-16 ***
Mo -7.7315 2.0623 -3.749 0.000184 ***
Si -15.6230 2.9154 -5.359 9.69e-08 ***
Cr -7.1898 0.5425 -13.254 < 2e-16 ***
V 8.3285 3.5131 2.371 0.017881 *
Co 2.0491 0.9144 2.241 0.025184 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1997.601)
Null deviance: 12075332 on 1502 degrees of freedom
Residual deviance: 2984416 on 1494 degrees of freedom
AIC: 15699
Number of Fisher Scoring iterations: 2
At the first glance, V and Co present weak predictors. These two have to be removed.
Checking residuals
Q-Q plot shows deviations for extreme values.
Residual Distribution is skewed.
Identifying Outliers
C Mn Si Cr Ni Mo V Co Al W Cu Nb
9 0.000 0.00 0.00000 0.53096 30.96549 0.00000 0.000 0.00 0.000 0.0 0.00 0
48 0.280 0.39 0.16000 2.35000 0.06000 0.06000 0.530 0.00 0.000 4.1 0.00 0
97 0.000 0.00 0.00000 0.00000 30.77361 1.36279 0.000 0.00 0.000 0.0 0.00 0
193 0.500 0.35 1.00000 0.11000 0.19000 0.50000 0.000 0.00 0.000 0.0 0.00 0
209 0.300 0.48 2.20000 10.50000 0.12000 1.00000 0.012 0.00 0.000 0.0 0.07 0
220 0.300 1.55 0.20000 0.00000 0.00000 0.28000 0.000 0.00 0.000 0.0 0.00 0
302 0.110 0.50 0.22000 0.00000 0.00000 0.56000 0.000 0.00 0.003 0.0 0.00 0
328 0.160 0.60 0.25000 0.20000 1.50000 0.05000 0.000 0.00 0.000 0.0 0.00 0
331 0.000 0.00 1.92824 0.00000 31.53985 0.00000 0.000 0.00 0.000 0.0 0.00 0
334 0.550 0.75 1.50000 0.70000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
365 1.040 0.33 0.26000 1.53000 0.31000 0.01000 0.010 0.00 0.000 0.0 0.20 0
387 0.330 1.12 0.30000 0.11000 0.24000 0.04000 0.000 0.00 0.010 0.0 0.19 0
389 0.190 1.17 0.41000 0.06000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
403 0.400 1.38 1.50000 0.00000 0.00000 0.80000 0.000 0.00 0.000 0.0 0.00 0
441 0.190 0.46 0.34000 7.83000 0.09000 2.02000 0.010 0.00 0.005 0.0 0.00 0
500 0.260 0.76 0.32000 1.08000 0.72000 1.25000 0.310 0.00 0.000 0.0 0.00 0
555 0.200 1.88 0.00000 0.00000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
573 0.380 0.82 1.48000 0.72000 0.00000 0.77000 0.000 0.00 0.000 0.0 0.00 0
624 0.440 0.75 0.26000 1.70000 0.17000 0.08000 0.090 0.00 0.000 0.0 0.18 0
629 0.430 0.95 1.38000 1.06000 0.03000 0.10000 0.035 0.00 0.000 0.0 0.05 0
643 0.380 1.45 0.36000 0.00000 0.00000 0.76000 0.000 0.00 0.000 0.0 0.00 0
691 0.100 0.47 0.28000 1.32000 2.34000 0.00000 0.000 0.00 0.000 0.0 0.87 0
701 0.110 0.00 0.00000 0.00000 3.49000 0.00000 0.000 0.00 0.000 0.0 0.00 0
702 0.470 0.82 0.35000 1.20000 0.04000 0.00000 0.110 0.00 0.000 0.0 0.14 0
710 0.150 0.36 0.44000 2.24000 0.09000 0.85000 0.000 0.00 0.097 0.0 0.23 0
821 0.220 0.83 0.24000 0.54000 1.06000 0.51000 0.000 0.00 0.029 0.0 0.30 0
847 0.190 1.00 0.04000 0.62000 0.02000 0.00000 0.000 0.00 0.000 0.0 0.00 0
877 0.360 0.78 0.31000 0.00000 0.73000 0.49000 0.000 0.00 0.000 0.0 0.00 0
882 0.300 0.70 0.20000 0.00000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
895 0.004 0.03 0.07500 0.00000 29.55000 0.00000 0.000 0.00 0.000 0.0 0.00 0
967 0.430 0.83 1.55000 0.91000 3.02000 0.40000 0.120 0.00 0.000 0.0 0.00 0
1035 0.400 1.47 0.37000 0.00000 0.00000 0.26000 0.000 0.00 0.000 0.0 0.00 0
1069 0.170 0.49 0.29000 0.18000 5.07000 0.24000 0.000 0.00 0.000 0.0 0.10 0
1091 0.330 0.74 0.23000 0.07000 3.47000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1112 0.390 0.70 0.20000 1.05000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1144 0.120 0.46 0.35000 4.79000 0.20000 0.54000 0.000 0.00 0.000 0.0 0.00 0
1148 0.400 0.80 0.33000 0.00000 0.00000 0.79000 0.000 0.00 0.000 0.0 0.00 0
1166 0.400 0.75 0.27000 0.96000 0.13000 0.07000 0.060 0.00 0.000 0.0 0.20 0
1223 0.540 0.46 0.00000 0.00000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1225 0.470 0.40 1.06000 0.00000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1226 0.910 0.65 0.23000 0.60000 1.35000 0.00000 0.000 0.00 0.000 0.0 0.03 0
1242 0.870 1.78 0.29000 0.20000 0.15000 0.03000 0.000 0.00 0.000 0.0 0.00 0
1291 0.390 1.67 0.00000 0.00000 0.10000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1302 0.360 1.50 0.20000 0.00000 0.00000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1346 0.430 1.57 0.23000 0.12000 0.20000 0.07000 0.000 0.00 0.000 0.0 0.00 0
1398 0.300 0.40 0.30000 0.86000 3.20000 0.40000 0.000 0.00 0.000 0.0 0.00 0
1436 0.550 0.65 0.20000 0.00000 0.65000 0.00000 0.000 0.00 0.000 0.0 0.00 0
1492 0.760 0.25 0.35000 4.54000 0.00000 5.75000 2.050 0.86 0.000 6.6 0.00 0
Ti B N Ms
9 0.000 0 0.000 218.1500
48 0.000 0 0.000 673.0000
97 0.000 0 0.000 199.1500
193 0.000 0 0.000 573.0000
209 0.000 0 0.000 528.0000
220 0.000 0 0.000 628.0000
302 0.000 0 0.002 775.0000
328 0.000 0 0.000 647.0000
331 0.000 0 0.000 171.1500
334 0.000 0 0.000 551.0000
365 0.000 0 0.000 518.0000
387 0.000 0 0.000 628.0000
389 0.000 0 0.000 700.0000
403 0.000 0 0.000 591.0000
441 0.000 0 0.013 623.0000
500 0.000 0 0.000 634.0000
555 0.000 0 0.000 666.0000
573 0.000 0 0.000 607.0000
624 0.000 0 0.000 573.0000
629 0.000 0 0.000 570.0000
643 0.000 0 0.000 601.0000
691 0.046 0 0.000 673.0000
701 0.000 0 0.000 722.0000
702 0.000 0 0.000 571.5000
710 0.010 0 0.000 698.0000
821 0.000 0 0.000 673.0000
847 0.000 0 0.000 705.0000
877 0.000 0 0.000 636.0000
882 0.000 0 0.000 666.0000
895 0.000 0 0.002 249.8167
967 0.000 0 0.000 568.0000
1035 0.000 0 0.000 596.0000
1069 0.000 0 0.000 641.0000
1091 0.000 0 0.000 583.0000
1112 0.000 0 0.000 606.0000
1144 0.000 0 0.000 736.0000
1148 0.000 0 0.000 603.0000
1166 0.035 0 0.010 634.0000
1223 0.000 0 0.000 596.0000
1225 0.000 0 0.000 594.2611
1226 0.000 0 0.000 422.0000
1242 0.000 0 0.000 422.0000
1291 0.000 0 0.000 618.0000
1302 0.000 0 0.000 603.0000
1346 0.000 0 0.000 572.0000
1398 0.000 0 0.000 598.0000
1436 0.000 0 0.000 548.0000
1492 0.000 0 0.000 473.0000
Remove rows where Carbon content is zero, update the model, and the summary of the model.
- The model has been updated after removing zero cells for Carbon content.
Call:
glm(formula = Ms ~ C + Ni + Mn + Mo + Si + Cr + V + Co, family = gaussian,
data = filtered_dataset_no_C0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 759.1787 3.0472 249.140 < 2e-16 ***
C -266.4946 4.1713 -63.887 < 2e-16 ***
Ni -14.0982 0.3195 -44.121 < 2e-16 ***
Mn -29.0478 1.8747 -15.494 < 2e-16 ***
Mo -8.0919 2.0278 -3.991 6.92e-05 ***
Si -16.6972 2.7500 -6.072 1.61e-09 ***
Cr -7.5038 0.5069 -14.804 < 2e-16 ***
V 8.3673 3.4113 2.453 0.01429 *
Co 2.3044 0.8627 2.671 0.00764 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1730.735)
Null deviance: 11796080 on 1456 degrees of freedom
Residual deviance: 2506104 on 1448 degrees of freedom
AIC: 15010
Number of Fisher Scoring iterations: 2
Updating the model by removing Mo and Co and adding an interaction parameter between Carbon and Manganese.
- The model has been updated after removing weak predictors.
Call:
glm(formula = Ms ~ C * Mn + Ni + Si + Cr + V, family = gaussian,
data = filtered_dataset_no_C0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 749.7319 3.1524 237.829 < 2e-16 ***
C -240.0692 5.6937 -42.164 < 2e-16 ***
Mn -18.6538 2.3738 -7.858 7.54e-15 ***
Ni -13.7303 0.3167 -43.358 < 2e-16 ***
Si -13.2855 2.7810 -4.777 1.96e-06 ***
Cr -8.3034 0.5069 -16.382 < 2e-16 ***
V 1.1567 3.1278 0.370 0.712
C:Mn -39.6809 6.0769 -6.530 9.09e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1701.745)
Null deviance: 11796080 on 1456 degrees of freedom
Residual deviance: 2465829 on 1449 degrees of freedom
AIC: 14984
Number of Fisher Scoring iterations: 2
Update the model by removing V and add an interaction parameter between Carbon and Nickel.
- Further revision in the model by removing a weak parameter and adding a new interaction parameter
Call:
glm(formula = Ms ~ C * Mn + C * Ni + Si + Cr, family = gaussian,
data = filtered_dataset_no_C0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 751.5794 3.0913 243.129 < 2e-16 ***
C -231.2782 5.4214 -42.660 < 2e-16 ***
Mn -19.0683 2.3316 -8.178 6.23e-16 ***
Ni -13.0691 0.3241 -40.326 < 2e-16 ***
Si -15.1211 2.7398 -5.519 4.03e-08 ***
Cr -8.5661 0.4913 -17.436 < 2e-16 ***
C:Mn -41.8580 5.8686 -7.133 1.55e-12 ***
C:Ni -13.8302 1.9610 -7.053 2.71e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1645.423)
Null deviance: 11796080 on 1456 degrees of freedom
Residual deviance: 2384218 on 1449 degrees of freedom
AIC: 14935
Number of Fisher Scoring iterations: 2
Log Model
After reviewing Q-Q plot and its deviations in extreme values and a skewed residual distribution, a log model has been considered. It will be compared with the first model at the end.
Call:
glm(formula = log(Ms) ~ C * Mn + C * Ni + Si + Cr, family = gaussian,
data = filtered_dataset_no_C0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6617940 0.0053830 1237.558 < 2e-16 ***
C -0.4182892 0.0094406 -44.307 < 2e-16 ***
Mn -0.0407894 0.0040602 -10.046 < 2e-16 ***
Ni -0.0240965 0.0005644 -42.697 < 2e-16 ***
Si -0.0245566 0.0047709 -5.147 3.01e-07 ***
Cr -0.0152132 0.0008555 -17.783 < 2e-16 ***
C:Mn -0.0698957 0.0102193 -6.840 1.17e-11 ***
C:Ni -0.0248897 0.0034148 -7.289 5.12e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.004989449)
Null deviance: 37.6380 on 1456 degrees of freedom
Residual deviance: 7.2297 on 1449 degrees of freedom
AIC: -3578
Number of Fisher Scoring iterations: 2
Performing ANOVA for the updated model.
Loading required package: ggpubr
Performing ANOVA for the Log Model
Checking leverage and influential points for both models
Investigate the Influential Points
C Mn Si Cr Ni Mo V Co Al W Cu Nb Ti B N Ms
1145 0.022 10.24 0.00 8.19 0.00 0.00 0.00 0 0 0 0.00 0 0 0 0.206 327.15
718 2.250 0.00 0.00 11.50 0.00 0.80 0.20 0 0 0 0.00 0 0 0 0.000 422.00
719 2.080 0.39 0.28 11.48 0.31 0.02 0.04 0 0 0 0.15 0 0 0 0.000 457.00
By reviewing Carbon and Chromium contents, it can be decided that these three points are not representing routine chemistry for steels. They are unusual, and they can be removed.
Remove Influential Points and Updating Models
Call:
glm(formula = Ms ~ C * Mn + C * Ni + Si + Cr, family = gaussian,
data = filtered_dataset_no_influential)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 753.7295 3.2150 234.445 < 2e-16 ***
C -242.2463 5.5389 -43.735 < 2e-16 ***
Mn -16.9666 2.6015 -6.522 9.58e-11 ***
Ni -13.2149 0.3142 -42.053 < 2e-16 ***
Si -15.0063 2.5997 -5.772 9.55e-09 ***
Cr -9.1535 0.4711 -19.429 < 2e-16 ***
C:Mn -40.5155 5.9696 -6.787 1.67e-11 ***
C:Ni -12.4910 1.8647 -6.699 3.01e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1481.353)
Null deviance: 11670630 on 1453 degrees of freedom
Residual deviance: 2142037 on 1446 degrees of freedom
AIC: 14751
Number of Fisher Scoring iterations: 2
Call:
glm(formula = log(Ms) ~ C * Mn + C * Ni + Si + Cr, family = gaussian,
data = filtered_dataset_no_influential)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6636794 0.0055789 1194.433 < 2e-16 ***
C -0.4350978 0.0096117 -45.267 < 2e-16 ***
Mn -0.0349821 0.0045145 -7.749 1.74e-14 ***
Ni -0.0242615 0.0005453 -44.491 < 2e-16 ***
Si -0.0243802 0.0045113 -5.404 7.60e-08 ***
Cr -0.0161918 0.0008175 -19.806 < 2e-16 ***
C:Mn -0.0707458 0.0103590 -6.829 1.25e-11 ***
C:Ni -0.0225066 0.0032359 -6.955 5.31e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.004460798)
Null deviance: 37.1057 on 1453 degrees of freedom
Residual deviance: 6.4503 on 1446 degrees of freedom
AIC: -3733.4
Number of Fisher Scoring iterations: 2
Investigate Points Found Out by Cook’s Distances
C Mn Si Cr Ni Mo V Co Al W Cu Nb Ti B N Ms
1199 1.36 1.84 1.14 0.15 1.81 1.41 0.00 0 0 0.00 0.00 0 0 0 0 439
1259 2.08 0.39 0.28 11.48 0.31 0.02 0.04 0 0 0.00 0.15 0 0 0 0 361
1122 1.56 0.37 0.20 12.46 0.26 0.54 0.65 0 0 0.28 0.10 0 0 0 0 458
These observation are valid and will be kept in the dataset.
Apply the Residuals Check to the Filtered Dataset
Counting Outliers
outlier n
1 Not Suspected 1429
2 Suspected 25
Plotting Carbon Content with Martensite Starting Temperature
Performing ANOVA for Both Models After Removing Influential Points
Check Multicollinearity
First Model
C Mn Ni Si Cr C:Mn C:Ni
2.392626 2.309298 1.417059 1.067519 1.073750 2.869351 1.194094
Log Model
C Mn Ni Si Cr C:Mn C:Ni
2.392626 2.309298 1.417059 1.067519 1.073750 2.869351 1.194094
There is no concerning level of multicollinearity for both models.
Remove Identified Outliers and Revise Both Models
Call:
glm(formula = Ms ~ C * Mn + C * Ni + Si + Cr, family = gaussian,
data = filtered_dataset_no_influential_no_outlier)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 769.4121 2.6216 293.489 < 2e-16 ***
C -286.7145 4.5296 -63.299 < 2e-16 ***
Mn -16.4197 2.1971 -7.473 1.36e-13 ***
Ni -14.0369 0.2361 -59.454 < 2e-16 ***
Si -13.8905 1.8663 -7.443 1.70e-13 ***
Cr -10.1289 0.3825 -26.478 < 2e-16 ***
C:Mn -41.4509 4.7323 -8.759 < 2e-16 ***
C:Ni -8.3589 1.3578 -6.156 9.68e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 760.2129)
Null deviance: 10969684 on 1428 degrees of freedom
Residual deviance: 1080263 on 1421 degrees of freedom
AIC: 13545
Number of Fisher Scoring iterations: 2
Performing ANOVA for the First Model After Removing Outliers
Checking Cook’s Distance After Removing Outliers
Log Model After Removing Outliers
Call:
glm(formula = log(Ms) ~ C * Mn + C * Ni + Si + Cr, family = gaussian,
data = filtered_dataset_no_influential_no_outlier)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6889963 0.0043407 1540.990 < 2e-16 ***
C -0.5102408 0.0074998 -68.034 < 2e-16 ***
Mn -0.0319981 0.0036379 -8.796 < 2e-16 ***
Ni -0.0255305 0.0003909 -65.309 < 2e-16 ***
Si -0.0225859 0.0030902 -7.309 4.48e-13 ***
Cr -0.0175028 0.0006334 -27.633 < 2e-16 ***
C:Mn -0.0751043 0.0078355 -9.585 < 2e-16 ***
C:Ni -0.0154270 0.0022482 -6.862 1.01e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.002084123)
Null deviance: 34.4909 on 1428 degrees of freedom
Residual deviance: 2.9615 on 1421 degrees of freedom
AIC: -4756.5
Number of Fisher Scoring iterations: 2
Performing ANOVA for The Log Model After Removing Outliers
Checking Cook’s Distance
Table 1 (Revised Dataset)
| Variable | Min | Max | Mean | Median | SD |
|---|---|---|---|---|---|
| Ms (Martensite Start Temp) | 310.00 | 784.00 | 601.80 | 605.00 | 120.00 |
| C (Carbon) | 0.00 | 1.46 | 0.36 | 0.33 | 0.10 |
| Mn (Manganese) | 0.00 | 4.95 | 0.79 | 0.69 | 0.30 |
| Ni (Nickel) | 0.00 | 27.20 | 1.56 | 0.15 | 0.50 |
| Si (Silicon) | 0.00 | 3.80 | 0.35 | 0.26 | 0.20 |
| Cr (Chromium) | 0.00 | 16.20 | 1.04 | 0.52 | 0.70 |
Updated Summary for the Revised Dataset
Ms C Mn Ni
Min. :310.0 Min. :0.0016 Min. :0.0000 Min. : 0.000
1st Qu.:553.5 1st Qu.:0.1600 1st Qu.:0.4700 1st Qu.: 0.000
Median :605.0 Median :0.3300 Median :0.6900 Median : 0.150
Mean :601.8 Mean :0.3617 Mean :0.7917 Mean : 1.558
3rd Qu.:670.0 3rd Qu.:0.4400 3rd Qu.:0.9700 3rd Qu.: 1.580
Max. :784.0 Max. :1.4600 Max. :4.9500 Max. :27.200
Si Cr
Min. :0.0000 Min. : 0.000
1st Qu.:0.2000 1st Qu.: 0.000
Median :0.2600 Median : 0.520
Mean :0.3475 Mean : 1.043
3rd Qu.:0.3400 3rd Qu.: 1.150
Max. :3.8000 Max. :16.200
vars n mean sd median trimmed mad min max range skew kurtosis
Ms 1 1429 601.77 87.65 605.00 606.84 85.99 310 784.00 474.00 -0.49 -0.07
C 2 1429 0.36 0.26 0.33 0.33 0.22 0 1.46 1.46 1.27 1.79
Mn 3 1429 0.79 0.56 0.69 0.74 0.34 0 4.95 4.95 2.15 10.15
Ni 4 1429 1.56 3.82 0.15 0.70 0.22 0 27.20 27.20 4.37 20.23
Si 5 1429 0.35 0.40 0.26 0.26 0.10 0 3.80 3.80 3.26 15.13
Cr 6 1429 1.04 1.99 0.52 0.60 0.77 0 16.20 16.20 4.18 20.19
se
Ms 2.32
C 0.01
Mn 0.01
Ni 0.10
Si 0.01
Cr 0.05
Histograms for Key Variables in the Revised Dataset
Table 2 Untransformed Model
| Variables | Mean ± SD | Correlation Coefficient | P-value |
|---|---|---|---|
| C | 0.36 ± 0.1 | -286.71 | < 2e-16 |
| Mn | 0.79 ± 0.3 | -16.42 | 1.36E-13 |
| Ni | 1.55 ± 0.5 | -14.04 | < 2e-16 |
| Si | 0.35 ± 0.2 | -13.89 | 1.70E-13 |
| Cr | 1.04 ± 0.7 | -10.13 | < 2e-16 |
| C:Mn | N/A | -41.45 | < 2e-16 |
| C:Ni | N/A | -8.36 | 9.68E-10 |
The model is:
\(Ms = 769.41 -286.71 C -16.42 Mn -14.04 Ni - 13.89 Si - 10.13Cr -41.45C:Mn - 8.36 C:Ni\)
Interpretation of the Coefficients
Each coefficient in this model tells us the marginal effect of that variable on Ms while keeping other variables constant.
Intercept:
- The intercept represents the predicted value of Ms when all predictors (C, Mn, Ni, Si, Cr) are at zero. This may not have a meaningful physical interpretation since, in practice, Ms is not typically defined when all elements are zero. However, it provides a baseline for the model.
Main Effects:
Carbon (C): The coefficient for C represents how much Ms is expected to change for a one-unit increase in Carbon content, holding all other elements constant. A negative coefficient would indicate that higher Carbon content reduces the Martensite start temperature, which aligns with metallurgical theory, as carbon stabilizes the austenite phase, reducing Ms.
Manganese (Mn): The coefficient for Mn shows how Ms changes with an increase in Manganese. If this coefficient is negative, it suggests that Mn lowers Ms, likely due to its effect on stabilizing austenite.
Nickel (Ni): The coefficient for Ni indicates the effect of Nickel on Ms. Nickel also stabilizes austenite, so a negative coefficient would align with its known effect on lowering Ms.
Silicon (Si): The coefficient for Si represents how changes in Silicon content affect Ms. Silicon often raises Ms because it promotes ferrite formation.
Chromium (Cr): The coefficient for Cr reflects Chromium’s effect on Ms. Chromium typically lowers Ms as it also stabilizes the austenite phase, so we might expect a negative coefficient here.
Interaction Parameters:
C and Mn
Interaction: This term captures the combined effect of Carbon and Manganese on Ms. If significant, it suggests that the effect of Carbon on Ms depends on the level of Manganese, and vice versa. A negative interaction term would imply that as both C and Mn increase together, they have a compounded effect in reducing Ms more than either element alone.
C and Ni
Interaction: Similarly, this term captures the interaction between Carbon and Nickel. If the coefficient is negative, it suggests that higher levels of both Carbon and Nickel together have an additional effect in lowering Ms, beyond their individual effects.
Table 2 Log-Transformed Model
| Variables | Mean ± SD | Correlation Coefficient | P-value |
|---|---|---|---|
| C | 0.36 ± 0.1 | -0.51 | < 2e-16 |
| Mn | 0.79 ± 0.3 | -0.032 | < 2e-16 |
| Ni | 1.55 ± 0.5 | -0.0255 | < 2e-16 |
| Si | 0.35 ± 0.2 | -0.0226 | 4.48E-13 |
| Cr | 1.04 ± 0.7 | -0.0175 | < 2e-16 |
| C:Mn | N/A | -0.0751 | < 2e-16 |
| C:Ni | N/A | -0.0154 | 1.01E-11 |
The model is:
\(log(Ms) = -6.69 - 0.51C - 0.03 Mn - 0.03 Ni - 0.03 Si - 0.02Cr - 0.07 C:Mn - 0.01C:Ni\)
Interpretation of the Coefficients
Each coefficient in this model represents the proportional change in Ms for a unit change in each predictor, holding all other variables constant.
Intercept:
- The interceptrepres ents the expected value of log(Ms) when all predictors are zero. Exponentiating this term gives the baseline value of Ms for a sample with zero values for C, Mn, Ni, Si, Cr. This value is hypothetical as these elements are rarely all zero in practice.
Main Effects:
Carbon (C): The coefficient for C \(\beta_C\) represents the proportional change in Ms for each additional unit of Carbon. Since the response is log-transformed, a unit increase in C multiplies Ms by \(e^{\beta_C}\) . If \(\beta_C\) is negative, it indicates that increasing Carbon reduces Ms, consistent with Carbon’s known effect of stabilizing austenite.
Manganese (Mn): The coefficient for Mn \(\beta_{Mn}\) shows the effect of Manganese on the log of Ms. A negative value indicates that Manganese decreases Ms in a multiplicative way, meaning each increase in Manganese content corresponds to a percentage decrease in Ms.
Nickel (Ni): The coefficient for Ni \(\beta_{Ni}\) indicates Nickel’s impact on the log of Ms. Like Manganese, a negative coefficient would imply that increasing Nickel decreases Ms proportionally. This aligns with Nickel’s effect as an austenite stabilizer.
Silicon (Si): The coefficient for Si \(\beta_{Si}\) captures Silicon’s impact on the log of Ms. Silicon often raises Ms by promoting ferrite formation, so a positive coefficient would imply that increasing Silicon raises Ms proportionally.
Chromium (Cr): The coefficient for Cr \(\beta_{Cr}\) represents Chromium’s effect on the log of Ms. Chromium generally lowers Ms due to its role in stabilizing austenite. A negative value here would confirm this behavior.
Interaction Parameters:
Interaction (\(\beta_{C:Mn}\)): This term captures the combined effect of Carbon and Manganese on the log of Ms. If significant, it suggests that the effect of Carbon on Ms depends on the level of Manganese, and vice versa. A negative interaction term means that as both C and Mn increase, they together reduce Ms more than each would individually.
Interaction (\(\beta_{C:Ni}\)): Similarly, this term captures the interaction between Carbon and Nickel. A negative coefficient here would imply that higher levels of both Carbon and Nickel together have an added effect in reducing Ms beyond their individual effects.
Cross-Validation
5-Fold
Cross-Validation for the First Model
5-Fold
[1] 774.1075 772.0337
10-Fold
[1] 779.8235 778.5136
Interpretation
Consistency: Both 5-fold and 10-fold cross-validations yield fairly similar results, which indicates stability in the model’s predictive performance across different validation approaches.
Slight Increase with 10-fold: The error slightly increased with 10-fold cross-validation. This could be due to the smaller subsets in each fold, as 10-fold divides the data into smaller groups compared to 5-fold, potentially revealing more variability or minor overfitting effects.
Implications for Model Evaluation: These results suggest that while the untransformed model’s performance is stable, there is no significant improvement in prediction error when using more folds. However, if computational efficiency or consistency is a priority, 5-fold might be a suitable choice.
Cross-Validation for the Second Model (Log Model)
5-Fold
[1] 0.002146943 0.002138416
10-Fold
[1] 0.002144025 0.002140193
Interpretation
Stability Across Folds: Both the 5-fold and 10-fold cross-validation results for the log-transformed model are extremely close, with very little variation between the fold types. This suggests that the log-transformed model is highly stable and performs consistently across different subsets of the data.
Comparison with Untransformed Model:
The cross-validation errors for the log-transformed model (
~0.0021) are significantly lower than those of the untransformed model (~774to~780). This indicates that the log-transformed model likely fits the data better and generalizes more effectively.The log-transformed model’s lower error suggests it may be less sensitive to outliers or non-normality, providing a more reliable predictive performance.
Selection of the Log-Transformed Model: Given the much lower cross-validation errors and consistency across folds, the log-transformed model is clearly outperforming the untransformed model. This makes it the preferable choice for accurate and robust predictions of the Martensite start temperature.
The Leave-One-Out Cross-Validation (LOOCV)
The Untransformed Model
[1] 772.7368 772.7308
The Log-Transformed Model
[1] 0.002125407 0.002125388
Model Comparison
Predictive Accuracy: The log-transformed model performs better in terms of LOOCV error, suggesting it is more reliable for prediction. This result aligns with the findings from the earlier steps, where the log-transformed model consistently showed lower residual deviance and AIC.
Practical Use: If the purpose of a model is interpretability or making predictions on the original scale of
Ms, the untransformed model may still be relevant despite the higher LOOCV error. However, for optimal prediction accuracy, the log-transformed model is superior based on these results.
Conclusion
The log-transformed model shows a more stable and lower prediction error with LOOCV, supporting its choice as the better model in terms of predictive performance.